Okay, so today I would like to try to bridge two topics that are, I guess, quite well related,
but maybe see from some historical reasons as distinct, which are optimal control and
reinforcement learning. Okay, so I guess manual view is much more, well, acquainted with optimal
control. So I will try to explain very briefly at the beginning of my talk what I mean by
reinforcement learning, what it is reinforcement learning, and then I try to make the bridge.
Okay, so the goal of reinforcement learning is basically this one. So suppose that you are an
agent that is here is pictured like a brain and this brain has to move or make decisions in an
environment that is unknown or just partially known, and from the environment the agent gets
back observations and a reward that can be a positive or negative reward. Examples like if
you have a dog that you are trying to train or you're trying to educate somehow, you want him
to do something, so you give a positive reward when it does something in the way you want,
you give a negative reward when he does something that in the way you don't want to do. Okay,
so reinforcement learning is one of the three paradigms of machine learning, supervised learning,
unsupervised learning, and reinforcement learning. So this is basically the third one and has achieved
some notable successes in the last, I would say, 15 years. One of them, I don't know if you play
chess or if you play to any of these online games, now they all have these artificial intelligence
programs like Stockfish or things like that. These are all based on reinforcement learning
techniques. The first one of this kind was AlphaGo and the first one was AlphaChess,
but there were also others before, but AlphaGo was extremely successful because Go is somehow
even more complicated game than chess, had a huge number of legal states, and basically AlphaGo was
the first artificial intelligence able to beat a human champ in this game. And this happened in
2013. This was published by David Silver and co-editors on Nature, and it was a huge success
because this happens years ahead of schedule, and I mean that the prediction was that maybe
this could have happened in 2016, instead it happened in 2013, actually in 2012, because the
prediction was based on the power of computer power computations, and the fact that with a more
reasonable amount of power computation this was achieved was a great success. Okay, but the link
between reinforcement learning and optimal control actually was very well known since the early stage
of reinforcement learning. If you try to learn something on reinforcement learning, I guess you
will crash into the monograph by Satton and Bartow, and here you can find a CDC paper of I think 1992,
in which basically co-authored by Satton, Bartow, and Williams, in which basically the title says
everything. Reinforcement learning is direct adaptive optimal control. What does it mean,
a direct adaptive optimal control? Well, basically reinforcement learning is nothing else than trying
to, let's say, learn the policy that the agent has to pursue without learning the model behind.
That's basically the main goal of reinforcement learning, and in this sense is a direct optimal
control. Now, just to give you the flavor of how a reinforcement learning algorithm is, let me show
you for instance one of the simplest reinforcement learning algorithms that is called Q-learning.
This I think was published in 1989 by Watkins in his PhD thesis, and basically now I'm using
notations that are maybe a little bit more familiar for mathematicians, but basically,
suppose that you just give a reward, an initial state, and an initialization of the so-called
action value function. Suppose that the agent moves according to a certain policy. This policy
can be given by a greedy policy like this one, or epsilon greedy, meaning that it's greedy with
a certain probability, or random with another one. It could be also purely random.
Here Q is the action value function, meaning that it's a function that when you maximize with respect
to U is a value function. That's what they do to build the Q. They start by initialization.
They say Q is...
No, the model is not given. What you have is just...
You assume that you just say, okay, I assume that you can observe at each stage, you can observe
where you are. You assume that if I am here, then if I move like this, I'm here, and now if I move
like this, I'm here. That's the only thing that you assume. Basically, you assume that at the
beginning you have a table of states and actions, and on each entry of this table you have the value
Presenters
Prof. Michele Palladino
Zugänglich über
Offener Zugang
Dauer
00:35:01 Min
Aufnahmedatum
2025-04-28
Hochgeladen am
2025-04-30 09:37:20
Sprache
en-US
• Alessandro Coclite. Politecnico di Bari
• Fariba Fahroo. Air Force Office of Scientific Research
• Giovanni Fantuzzi. FAU MoD/DCN-AvH, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Borjan Geshkovski. Inria, Sorbonne Université
• Paola Goatin. Inria, Sophia-Antipolis
• Shi Jin. SJTU, Shanghai Jiao Tong University
• Alexander Keimer. Universität Rostock
• Felix J. Knutson. Air Force Office of Scientific Research
• Anne Koelewijn. FAU MoD, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Günter Leugering. FAU, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Lorenzo Liverani. FAU, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Camilla Nobili. University of Surrey
• Gianluca Orlando. Politecnico di Bari
• Michele Palladino. Università degli Studi dell’Aquila
• Gabriel Peyré. CNRS, ENS-PSL
• Alessio Porretta. Università di Roma Tor Vergata
• Francesco Regazzoni. Politecnico di Milano
• Domènec Ruiz-Balet. Université Paris Dauphine
• Daniel Tenbrinck. FAU, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Daniela Tonon. Università di Padova
• Juncheng Wei. Chinese University of Hong Kong
• Yaoyu Zhang. Shanghai Jiao Tong University
• Wei Zhu. Georgia Institute of Technology